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Abstract 

Unsupervised feature learning has shown impressive results for a wide range of in- 
put modalities, in particular for object classification tasks in computer vision. Us- 
ing a large amount of unlabeled data, unsupervised feature learning methods are 
utilized to construct high-level representations that are discriminative enough for 
subsequently trained supervised classification algorithms. However, it has never 
been quantitatively investigated yet how well unsupervised learning methods can 
find low-level representations for image patches without any additional supervi- 
sion. In this paper we examine the performance of pure unsupervised methods on 
a low-level correspondence task, a problem that is central to many Computer Vi- 
sion applications. We find that a special type of Restricted Boltzmann Machines 
(RBMs) performs comparably to hand-crafted descriptors. Additionally, a simple 
binarization scheme produces compact representations that perform better than 
several state-of-the-art descriptors. 



1 Introduction 

In this paper we tackle a recent computer vision dataset [2] from the viewpoint of unsupervised 
feature learning. Why yet another dataset? There are already enough datasets that serve well for 
evaluating feature learning algorithms. In particular for feature learning from image data, several 
well-established benchmarks exist: Caltech-101 (TO), CIFAR-10 Q9), NORB |23), to name only a 
few. Notably, these benchmarks are all object classification tasks. Unsupervised learning algorithms 
are evaluated by considering how well a subsequent supervised classification algorithm performs 
on high-level features that are found by aggregating the learned low-level representations |8]. We 
think that mingling these steps makes it difficult to assess the quality of the unsupervised algorithms. 
A more direct way is needed to evaluate these methods, preferably where a subsequent supervised 
learning step is completely optional. 

We are not only at odds with the methodology of evaluating unsupervised learning algorithms. Gen- 
eral object classification tasks are always based on orientation- and scale-rectified pictures with 
objects or themes firmly centered in the middle. We are looking for a dataset where it is possible 
to show that unsupervised feature learning is beneficial to the wide range of Computer Vision tasks 
beyond object classification, like tracking, stereo vision, panoramic stitching or structure from mo- 
tion. One might argue, that object classification acts as a good proxy for all these other tasks but 
this hypothesis has not shown to be correct either theoretically or through empirical evidence. In- 
stead, we chose the most general and direct task that can be be applied to low-level representations: 
matching these representations, i.e. determining if two data samples are similar given their learned 
representation. 

Matching image descriptors is a central problem in Computer Vision, so hand-crafted descriptors 
are always evaluated with respect to this task l28l . Given a dataset of labeled correspondences, 
supervised learning approaches will find representations and the accompanying distance metric that 
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are optimized with respect to the induced similarity measure. It is remarkable that hand-engineered 
descriptors perform well under this task without the need to learn such a measure for their represen- 
tations in a supervised manner. 

To the best of our knowledge it has never been investigated whether any of the many unsupervised 
learning algorithms developed over the last couple of years can match this performance without 
relying on any supervision signals. While we propose an additional benchmark for unsupervised 
learning algorithms, we do not introduce a new learning algorithm. We rather investigate the per- 
formance of the Gaussian RBM (GRBM) [39), its sparse variant (spGRBM) |29) and the mean 
covariance RBM (mcRBM) 11331 without any supervised learning with respect to the matching task. 
As it turns out, the mcRBM performs comparably to hand-engineered feature descriptors. In fact 
using a simple heuristic, the mcRBM produces a compact binary descriptor that performs better than 
several state-of-the-art hand-crafted descriptors. 

We begin with a brief description of the dataset used for evaluating the matching task, followed by 
a section on details of the training procedure. In section[4]we present our results, both quantitatively 
and qualitatively and also mention other models that were tested but not further analyzed because 
of overall bad performance. Section [5] concludes with a brief summary and an outlook for future 
work. A review of GRBMs, spGRBMs and mcRBMs is provided in the appendix, section [6] for 
completeness. 

Related work Most similar in spirit to our work are [6, 20, 22 1: Like us, 10122] are interested in 
the behavior of unsupervised learning approaches without any supervised steps afterwards. Whereas 
both investigate high-level representations. [20 1 learns a compact, binary representation with a 
very deep autoencoder in order to do fast content-based image search (semantic hashing, |36|). 
Again, these representations are studied with respect to their capabilities to model high-level object 
concepts. Additionally, various algorithms to learn high-level correspondences have been studied 
H [37] [161 in recent years. 

Finding (compact) low-level image descriptors should be an excellent machine learning task: Even 
hand-designed descriptors have many free parameters that cannot (or should not) be optimized man- 
ually. Given ground truth data for correspondences, the performance of supervised learning algo- 
rithms is impressive J2). Very recently, boosted learning with image gradient-based weak learners 
has shown excellent results 14311421 on the same dataset used in this paper. See section 2 of 11431 for 
more related work in the space of supervised metric learning. 

2 Dataset 

At the heart of this paper is a recently introduced dataset for discriminative learning of local image 
descriptors [2 J. It attempts to foster learning optimal low-level image representations using a large 
and realistic training set of patch correspondences. The dataset is based on more than 1.5 million 
image patches (64 x 64 pixels) of three different scenes: the Statue of Liberty (about 450,000 
patches), Notre Dame (about 450,000 patches) and Yosemite's Half Dome (about 650,000 patches). 
The patches are sampled around interest points detected by Difference of Gaussians ll27l and are 
normalized with respect to scale and orientation^] As shown in Figure [T| the dataset has a wide 
variation in lighting conditions, viewpoints, and scales. 

The dataset contains also approximately 2.5 million image correspondences. Correspondences be- 
tween image patches are established via dense surface models obtained from stereo matching (stereo 
matching, with its epipolar and multi-view constraints, is a much easier problem than unconstrained 
2D feature matching). The exact procedure to establish correspondences is more involved and de- 
scribed in detail in O Section II]. Because actual 3D correspondences are used, the identified 2D 
patch correspondences show substantial perspective distortions resulting in a much more realistic 
dataset than previous approaches [24 28 1 . The dataset appears very similar to an earlier benchmark 
of the same authors flTl . yet the correspondences in the novel dataset resemble a much harder prob- 
lem. The error rate at 95% detection of correct matches for the SIFT descriptor [27] raises from 6% 
to 26%, the error rate for evaluating patch similarity in pixel space (using normalized sum squared 
differences) raises from 20% to at least 48% (all numbers are take from [|47l and ||2] respectively), 

'A similar dataset of patches centered on multi-scale Harris corners is also available. 
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Figure 1: Patch correspondences from the Liberty dataset. Note the wide variation in lighting, 
viewpoint and level of detail. The patches are centered on interest points but otherwise can be 
considered random, e.g. there is no reasonable notion of an object boundary possible. Figure taken 
from El- 



for example. In order to facilitate comparison of various descriptor algorithms a large set of prede- 
termined match/non-match patch pairs is provided. For every scene, sets comprising between 500 
and 500,000 pairs (with 50% matching and 50% non-matching pairs) are available. 

We don't argue that this dataset subsumes or substitutes any of the previously mentioned bench- 
marks. Instead, we think that it can serve to complement those. It constitutes an excellent testbed 
for unsupervised learning algorithms: Experiments considering self-taught learning [32], effects of 
semi-supervised learning, supervised transfer learning over input distributions with a varying de- 
gree of similarity (the scenes of Statue of Liberty and Notredame show architectural structures, 
while Half Dome resembles a typical natural scenery) and the effect of enhancing the dataset with 
arbitrary image patches around keypoints can all be conducted in a controlled environment. Further- 
more, end-to-end trained systems for (large) classification problems (like I2T1 I51) can be evaluated 
with respect to this type of data distribution and task. 



3 Training Setup 

Different to [2], our models are trained in an unsupervised fashion on the available patches. We train 
on one scene (400,000 randomly selected patches from this scene) and evaluate the performance on 
the test set of every scene. This allows us to investigate the self-taught learning paradigm 11321 . We 
also train on all three scenes jointly (represented by 1.2 million image patches) and then evaluate 
again every scene individually. 



3.1 GRBM/spGRBM 



The GRBM and spGRBM (see Appendix, section 6.2 1 only differ in the setting of the sparsity 
penalty A sp , all other settings are the same. We use CDi lfl3l to compute the approximate gradient 
of the log-likelihood and the recently proposed rmsprop |41 ] method as gradient ascent method. 
Compared to standard minibatch gradient ascent, we find that rmsprop is a more efficient method 
with respect to the training time necessary to learn good representations: it takes at most half of the 
training time necessary for standard minibatch gradient ascent. 

Before learning the parameters, we first scale all image patches to 16 x 16 pixels. Then we preprocess 
all training samples by subtracting the vectors' mean and dividing by the standard deviation of its 
elements. This is a common practice for visual data and corresponds to local brightness and contrast 
normalization. ||39l Section 2.2] gives also a theoretical justification for why this preprocessing step 
is necessary to learn a reasonable precision matrix A. We find that this is the only preprocessing 
scheme that allows GRBM and spGRBM to achieve good results. In addition, it is important to learn 
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A — setting it to the identity matrix, a common practice fl4"l, also produces dissatisfying error rates. 
Note that originally it was considered that learning A is mostly important when one wants to find a 
good density (i.e. generative) model of the data. 

Both GRBM and spGRBM have 512 hidden units. The elements of W are initialized according to 
Af(0, 0.1), the biases are initialized to 0. rmsprop uses a learning rate of 0.001, the decay factor is 
0.9, the minibatch size is 128. We train both models for 10 epochs (this takes about 15 minutes on 
a consumer GPU for 400000 patches). For the spGRBM we use a sparsity target of p = 0.05 and a 
sparsity penalty of A sp = 0.2. spGRBM is very sensitive to settings of A sp l38l — setting it too high 
results in dead representations (samples that have no active hidden units) and the results deteriorate 
drastically. 



3.2 mcRBM 



mcRBM (see Appendix, section 6.3 i training is performed using the code from 1331 . We resam- 
ple the patches to 16 x 16 pixels. Then the samples are preprocessed by subtracting their mean 
(patchwise), followed by PCA whitening, which retains 99% of the variance. The overall training 
procedure (with stochastic gradient descent) is identical to the one described in [33 , Section 4]. We 
train all architectures for a total of 100 epochs, however updating P is only started after epoch 
50. We consider two different mcRBM architectures: The first has 256 mean units, 512 factors 
and 512 covariance units. P is not constrained by any fixed topography. We denote this architec- 
ture by mcRBM(256, 512/512). The second architecture is concerned with learning more compact 
representations: It has 64 mean units, 576 factors and 64 covariance units. P is initialized with a 
two-dimensional topography that takes 5x5 neighborhoods of factors with a stride equal to 3. We 
denote this model by mcRBM(QA, 576/64). On a consumer grade GPU it takes 6 hours to train the 
first architecture on 400000 samples and 4 hours to train the second architecture on the same number 
of samples. 



4 Results 



For the results presented in this section (Table [T| we follow the evaluation procedure of J2): For 
every scene (Liberty (denoted by LY), Notredame (ND) and Half Dome (HD)), we use the labeled 
dataset with 100,000 image pairs to assess the quality of a trained model on this scene. In order to 
save space we do not present ROC curves and only show the results in terms of the 95% error rate 
which is the percent of incorrect matches when 95% of the true matches are found: After computing 
the respective distances for all pairs in a test set, a threshold is determined such that 95% of all 
matching pairs have a distance below this threshold. Non-matching pairs with a distance below this 
threshold are considered incorrect matches. 

Table [T] consists of two subtables. Table []J presents the error rates for GRBM, spGRBM and 
mcRBM when no limitations on the size of representations are placed. Table [TJ) only considers 
descriptors that have an overall small memory footprint. For GRBM and spGRBM we use the 
activations of the hidden units given a preprocessed input patch v as descriptor D(y) (see eq. [5] 
section [O) : 

D(v) = a(v T A^W + b) 

For the mcRBM a descriptor is formed by using the activations of the latent covariance units alone, 
see eq.[8] section 6.3 

D(v) = a(P T (C T v) 2 + c) 

This is in accordance with manually designed descriptors. Many of these rely on distributions (i.e. 
histograms) of intensity gradients or edge directions [27, 28, Q, structural information which is 
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encoded by the covariance units (see also 11351 Section 2] 



4.1 Distance metrics 



As we explicitly refrain from learning a suitable (with respect to the correspondence task) distance 
metric with a supervised approach, we have to resort to standard distance measures. The Euclidean 

2 Extending the descriptor with mean units degrades results. 
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Table 1: Error rates, i.e. the percent of incorrect matches when 95% of the true matches are found. 
All numbers for GRBM, spGRBM and mcRBMs are given within ±0.5%. Every subtable, indicated 
by an entry in the Method column, denotes a descriptor algorithm. Descriptor algorithms that do not 
require learning (denoted by - in the column Training set) are represented by one line. The numbers 
in the columns labeled LY, ND and HD are the error rates of a method on the respective test set for 
this scene. Supervised algorithms are not evaluated (denoted by -) on the scene they are trained on. 
The Training set LY/ND/HD encompasses 1.2 million patches of all three scenes; this setting is only 
possible for unsupervised learning methods, (a) Error rates for several unsupervised algorithms 
without restricting the size of the learned representation. GRBM, spGRBM and mcRBM learn 
descriptors of dimensionality 512. (Li^i) denotes that the error rates for a method are with respect 
to £i normalization of the descriptor under the L\ distance, (b) Results for compact descriptors. 
BRIEF (32 bytes) and BRISK (64 bytes) (53) are binary descriptors, SURF (Tj is a real valued 
descriptor with 64 dimensions. BinBoost |42|, ITQ-SIFT lfL2ll and D-Brief [44] learn compact 
binary descriptors with supervision. Numbers for BRIEF, BRISK, SURF, BinBoost and ITQ-SIFT 
are from (42). 



distance is widely used when comparing image descriptors. Yet, considering the generative nature of 
our models we follow the general argumentation of [ 17 1 and choose the Manhattan distance, denoted 
in this text by L\. We also consider two normalization schemes for patch representations, £i and £ 2 
(i.e. after a feature vector x is computed, its length is normalized such that ||cc|| i = 1 or ||cc||2 = 1). 

Given a visible input both (sp)GRBM and mcRBM compute features that resemble parameters of 
(conditionally) independent Bernoulli random variables. Therefore we consider the Jensen-Shannon 
divergence (JSD) E6l as an alternative similarity measure. Finally, for binary descriptors, we use 
the Hamming distance. 



4.2 SIFT Baseline 

SIFT [27 1 (both as interest point detector and descriptor) was a landmark for image feature matching. 
Because of its good performance it is one of the most important basic ingredients for many different 
kinds of Computer Vision algorithms. It serves as a baseline for evaluating our models. We use 
vlfeat [45] to compute the SIFT descriptors. 
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The performance of the SIFT descriptor, l\ -normalized, is reported (using L\ distance) in Table [T^, 
first entry. l\ normalization provides better results than £2 normalization or no normalization at all. 
SIFT performs descriptor sampling at a certain scale relative to the Difference of Gaussians peak. 
In order to achieve good results, it is essential to optimize this scale parameter [!2, Figure 6] on 
every dataset. Table [TJi is concerned with evaluating compact descriptors: the first entry shows the 
performance of SIFT when used as a 128-byte descriptor (i.e. no normalization applied, but again 
optimized for the best scale parameter) with L\ distance. 



4.3 Quantitative analysis 

Table shows that SIFT performs better than all three unsupervised methods. mcRBM(256, 
512/512) performs similar to SIFT when trained on Half Dome, albeit at the cost of a 4.5 times 
larger descriptor representation. The compact binary descriptor (the simple binarization scheme 
is described below) based on mcRBM(64, 576/64) performs remarkably well, comparable or even 
better than several state-of-the-art descriptors (either manually designed or trained in a supervised 
manner), see Table [TJ), last entry. We discuss in more detail several aspects of the results in the 
following paragraphs. 



GRBM and spGRBM spGRBM performs considerably better than its non-sparse version (see 
Table second and third entries). This is not necessarily expected: Unlike e.g. in classification 
[8 1 sparse representations are considered problematic with respect to evaluating distances directly. 
Lifetime sparsity may be after all beneficial in this setting compared to strictly enforced population 
sparsity. We plan to investigate this issue in more detail in future work by comparing spGRBM to 
Cardinality restricted boltzman machines [38 1 on this dataset. 



Self-taught paradigm We would expect that the performance of a model trained on the Liberty 
dataset and evaluated on the Notre Dame scene (and vice versa) should be noticeably better than 
the performance of a model trained on Half Dome and evaluated on the two architectural datasets. 
However, this is not what we observe. In particular for the mcRBM (both architectures) it is the 
opposite: Training on the natural scene data leads to much better performance than the assumed 
optimal setting. 



Jensen-Shannon Divergence Both GRBM and spGRBM perform poorly under the Jensen- 
Shannon divergence similarity (overall error rates are around 60%), therefore we don't report these 
numbers in the table. Similar, results for mcRBM under JSD are equally bad. However, if one scales 
down P by a constant (we found the value of 3 appropriate), the results with respect to JSD improve 
noticeably, see Table the last entry. The performance on the Half Dome dataset is still not good 
- the scaling factor should be learned [9 1, which we also plan for future work. 



Compact binary descriptor We were not successful in finding a good compact representa- 
tion with either GRBM or spGRBM. Finding compact representations for any kind of input 
data should be done with multiple layers of nonlinearities |20l . But even with only two layers 
{mcRBM{QA, 576/64)) we learn relatively good compact descriptors. If features are binarized, the 
representation can be made even more compact (64 bits, i.e. 8 bytes). In order to find a suitable 
binarization threshold we employ the following simple heuristic: After training on a dataset is fin- 
ished we histogram all activations (values between and 1) of the training set and use the median 
of this histogram as the threshold. 



4.4 Qualitative analysis 

We briefly comment on the developed filters (Figure [2]). Unsurprisingly, spGRBM (Figure [2^) and 
mcRBM (Figure |2j) — these are columns from C) learn Gabor like filters. At a closer look we make 
some interesting observations: Figure [2}; shows the diagonal elements of A 1 / 2 from a spGRBM. 
When computing a latent representation, the input v is scaled (elementwise) by this matrix, which, 
visualized as a 2D image, resembles a Gaussian that is dented at the center, the location of the 
keypoint of every image patch. The mcRBM also builds filters around the keypoint: Figure|2]l shows 
some unusual filters from C. They are centered around the keypoint and bear a strong resemblance 
to discriminative projections (Figure |2£) that are learned in a supervised way on this dataset [2 
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Figure 2: (a) Typical filters learned with spGRBM. (b) Filters from an mcRBM. (c) The pixelwise 
inverted standard deviations learned with a spGRBM plotted as a 2D image (darker gray intensities 
resemble lower numerical values). An input patch is elementwise multiplied with this image when 
computing the latent representation. This figure is generated by training on 32 x 32 patches for 
better visibility, but the same qualitative results appear with 16 x 16 patches, (d) The mcRBM also 
learns some variants of log-polar filters centered around the DoG keypoint. These are very similar 
to filters found when optimizing for the correspondence problem in a supervised setting. Several 
of such filters are shown in subfigure (e), taken from [2, Figure 5]. Finally (f), the basic keypoint 
filters are combined with Garbor filters, if these are placed close to the center; the Garbor filters get 
systematically arranged around the keypoint filters. 



Figure 5]. Qualitatively, the filters in Figure [2}l resemble log-polar filters that are used in several 
state-of-the-art feature designs ||28l . The very focused keypoint filters (first column in Figure |2ji) 
are often combined with Gabor filters placed in the vicinity of the center - the Garbor filters appear 
on their own, if they are too far from the center. If an mcRBM is trained with a fixed topography for 
P, one sees that the Gabor filters get systematically arranged around the keypoint (Figure [2JT). 

4.5 Other models 

We also trained several other unsupervised feature learning models: GRBM with nonlinear rectified 
hidden unit^]|30|, various kinds of autoencoders (sparse Q and denoising |46| autoencoders), K- 

3 Our experiments indicate that rmsprop is in this case also beneficial with respect to the final results: It 
learns models that perform about 2-3% better than those trained with stochastic gradient descent. 
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means [7] and two layer models (stacked RBMs, autoencoders with two hidden layers, cRBM |34|). 
None of these models performed as good as the spGRBM. 

5 Conclusion 

We start this paper suggesting that unsupervised feature learning should be evaluated (i) without us- 
ing subsequent supervised algorithms and (ii) more directly with respect to its capacity to find good 
low-level image descriptors. A recently introduced dataset for discriminatively learning low-level 
local image descriptors is then proposed as a suitable benchmark for such an evaluation scheme that 
complements nicely the existing benchmarks. We demonstrate that an mcRBM learns real-valued 
and binary descriptors that perform comparably or even better to several state-of-the-art methods on 
this dataset. 

In future work we plan to evaluate deeper architectures ll20l . combined with sparse convolutional 
features [18| on this dataset. Moreover, ongoing work investigates several algorithms J4] [37) for 
supervised correspondence learning on the presented dataset. 
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6 Appendix 



6.1 Gaussian-Binary Restricted Boltzmann Machine 

The Gaussian-Binary Restricted Boltzmann Machine (GRBM) is an extension of the Binary-Binary 
RBM [11] that can handle continuous data |[T5l[39ll . It is a bipartite Markov Random Field over a 
set of visible units, v £ R Nv , and a set of hidden units, h £ {0, 1}^. Every configuration of units 
v and units h is associated with an energy E(v, h), defined as 

E(v, h; 9) = ^v T Av - v T Aa - h T b - v T AWh (1) 

with 9 = (W £ R N " xNh ,a £ R N ",b £ R Nh , A £ i?^ xjv «), the model parameters. W rep- 
resents the visible-to-hidden symmetric interaction terms, a and b represent the visible and hidden 
biases respectively and A is the precision matrix of v, taken to be diagonal. E(v, h) induces a 
probability density function over v and h: 

exp(-E(v,h;9)) 
P(«,M) = W) (2) 

where Z(9) is the normalization partition function, Z{9) = J ^2 h exp(—E(v, h; #)) dv. 

Learning the parameters 9 is accomplished by gradient ascent in the log-likelihood of 9 given N 
i.i.d. training samples. The log-probability of one training sample is 



\ogp(v)^-^v T Av + v T Aa + ^\og\^l+exp\^vf{AiW) lJ +b 1 jj - Z{9) (3) 

Evaluating Z(9) is intractable, therefore algorithms like Contrastive Divergence (CD) [13] or per- 
sistent CD (PCD) Pol are used to compute an approximation of the log-likelihood gradient. The 
bipartite nature of an (G)RBM is an important aspect when using these algorithms: The visible units 
are conditionally independent given the hidden units. They are distributed according to a diagonal 
Gaussian: 

p(v | h) - N{A-^Wh + a, A" 1 ) (4) 

Similarly, the hidden units are conditionally independent given the visible units. The conditional 
distribution can be written compactly as 

p(h | v) = (i(v T A^W + b) (5) 
where a denotes the element-wise logistic sigmoid function, <r(z) = 1/(1 + e~ z ). 



6.2 Sparse GRBM 

In many tasks it is beneficial to have features that are only rarely active ||29l l8l. Sparse activation of a 
binary hidden unit can be achieved by specifying a sparsity target p and adding an additional penalty 
term to the log-likelihood objective that encourages the actual probability of unit j of being active, 
qj, to be close to p ||29l[l4l . This penalty is proportional to the negative KL divergence between the 
hidden unit marginal qj = J2 n P(hj = 1 \ v n) an d tne target sparsity: 

A sp (ploggj + (1 - p)log(l - q 3 )), (6) 

where A sp represents the strength of the penalty. This term enforces sparsity of feature j over the 
training set, also referred to as lifetime sparsity. The hope is that the features for one training sample 
are then encoded by a sparse vector, corresponding to population sparsity. We denote a GRBM with 
a sparsity penalty A sp > as spGRBM. 
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6.3 Mean-Covariance Restricted Boltzmann Machine 



In order to model pairwise dependencies of visible units gated by hidden units, a third-order RBM 
can be defined with a weight Wijk for each triplet w i; vj, h^. By factorizing and tying these weights, 
parameters can be reduced to a filter matrix C G R N ^ xF and a pooling matrix P e R FxNh . C 
connects the input to a set of factors and P maps factors to hidden variables. The energy function 
for this cRBM P4l is 

E c {v,h c ;8) = -(v T C T ) 2 Ph c -c T h c (7) 

where (-) 2 denotes the element-wise square operation and 9 = {C, P, c}. Note that P has to be 
non-positive P4l Section 5]. The hidden units of the cRBM are still conditionally independent given 
the visible units, so inference remains simple. Their conditional distribution (given visible state v) 
is 

p(h c \v)=a(P T (C T v) 2 + c) (8) 

The visible units are coupled in a Markov Random Field determined by the setting of the hidden 
units: 

p(v | h c ) ~ Af(0, S) (9) 

with 

£ _1 = Cdi&g(-Ph c )C T (10) 

As equation [9] shows, the cRBM can only model Gaussian inputs with zero mean. For general 
Gaussian-distributed inputs the cRBM and the GRBM can be combined into the mean-covariance 
RBM (mcRBM) by simply adding their respective energy functions: 

E mc (v,h m ,h c ;6,6') = E n (v,h m ;d) + E c (v,h c ,9') (11) 

E m (v, h m ; 6) denotes the energy function of the GRBM (see eq. [TJ with A fixed to the identity 
matrix. The resulting conditional distribution over the visible units, given the two sets of hidden 
units h m {mean units) and h c (covariance units) is 

p(v | h m , h c ) - M{SWh m , S) (12) 

with S defined as in eq.[To] The conditional distributions p(h m \v) and p(h c \v) are still as in eq.[5] 
and eq. [7] respectively. The parameters 9,6' can again be learned using approximate Maximum 
Likelihood Estimation, e.g. via CD or PCD. These methods require to sample from p(v\h m , h c 



which involves an expensive matrix inversion (see eq. 10 1. Instead, samples are obtained by using 
Hybrid Monte Carlo (HMC) [ 3 1 1 on the mcRBM free energy ll33l. 
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